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Abstract 

Background: Discovering sequence patterns with variation can unveil functions of a protein family that are 
important for drug discovery. Exploring protein families using existing methods such as multiple sequence 
alignment is computationally expensive, thus pattern search, called motif finding in Bioinformatics, is used. 
However, at present, combinatorial algorithms result in large sets of solutions, and probabilistic models require a 
richer representation of the amino acid associations. To overcome these shortcomings, we present a method for 
ranking and compacting these solutions in a new representation referred to as Aligned Pattern Clusters (APCs). To 
tackle the problem of a large solution set, our method reveals a reduced set of candidate solutions without losing 
any information. To address the problem of representation, our method captures the amino acid associations and 
conservations of the aligned patterns. Our algorithm renders a set of APCs in which a set of patterns is discovered, 
pruned, aligned, and synthesized from the input sequences of a protein family. 

Results: Our algorithm identifies the binding or other functional segments and their embedded residues which are 
important drug targets from the cytochrome c and the ubiquitin protein families taken from Unitprot. The results 
are independently confirmed by pFam's multiple sequence alignment. For cytochrome c protein the number of 
resulting patterns with variations are reduced by 76.62% from the number of original patterns without variations. 
Furthermore, all of the top four candidate APCs correspond to the binding segments with one of each of their 
conserved amino acid as the binding residue. The discovered proximal APCs agree with pFam and PROSITE results. 
Surprisingly, the distal binding site discovered by our algorithm is not discovered by pFam nor PROSITE, but 
confirmed by the three-dimensional cytochrome c structure. When applied to the ubiquitin protein family, our 
results agree with pFam and reveals six of the seven Lysine binding residues as conserved aligned columns with 
entropy redundancy measure of 1.0. 

Conclusion: The discovery, ranking, reduction, and representation of a set of patterns is important to avert time- 
consuming and expensive simulations and experimentations during proteomic study and drug discovery. 



Introduction 

A key concern in healthcare is the major human diseases 
of the decade, i.e. cancer [1], Alzheimer disease, and 
S ARS [2] . Researchers are critically pursuing solutions to 
address these diseases. During drug discovery, it is crucial 
to identify proteins as drug targets and validate their 
functionality [3]. Binding sites are typically the central 
function of a protein, and therefore, recognizing them is 
essential in protein function analysis. Although each 
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protein of the same protein family performs the same 
function, there are variations amongst the amino acids 
on each primary sequence. Hence, the conserved amino 
acid associations on the protein sequences from one pro- 
tein family reflect important functions. For example, a 
significant functionality of the cytochrome c protein is to 
bind the heme ligand from its binding sites [4], and the 
iron atom in the heme ligand is bonded by two binding 
residues, one for each side. In our experiments, it was 
found that each of these two binding residues is con- 
tained in a binding segment, which is represented by a 
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sequence pattern with variations. The binding of cyto- 
chrome c and its release from the mitochondria has been 
shown to prevent cell death for cancer treatment [5]. 
Similarly, the ubiquitin protein contains seven binding 
residues that are also surrounded by binding segments. 
These binding residues and segments function by linking 
individual ubiquitins to create a unique poly-ubiquitin 
that can be recognized by other ubiquitins. Linking of 
these binding proteins is directly involved in the control 
of cancer progression [6]. A common approach to study- 
ing a protein family's function is finding the sequence 
patterns that have variations. Functional patterns can 
mutate through evolution [7,8]; thus each occurrence of 
the pattern may not be an exact repeat at the same loca- 
tion. Hence it is difficult to find and locate the segments 
that embed the binding residues. Figure 1 illustrates sim- 
pler patterns that might occur in the consensus region of 
a protein family. The example contains six text patterns 
embedded five times each in 30 input sequences and will 
be referred to as the illustrative example throughout the 
paper. 

Figure 1: An intuitive example from the cytochrome c 
protein showing parts of the protein sequence that 
represent the binding sites. 

In Bioinformatics, two common approaches for identi- 
fying the protein family's function are by multiple 
sequence alignment and by motif finding. Multiple 
sequence alignment aligns a set of protein sequences 
from the same protein family in order to identify impor- 
tant regions and sites in the resulting alignment. Com- 
mon multiple sequence alignments include Clustal 
Omega [9], T-Coffee [10], DIALIGN [11] and HMMER 
[12]. However, finding the global optimal alignment is 
computationally expensive, and is known in computa- 
tional complexity as an NP-complete problem [13]. 
Even with approximate heuristics added, multiple 
sequence alignment is not efficient in handling large 
datasets. Moreover, this approach is only appropriate for 



highly similar sequences, but not for sequences with 
considerable dissimilarity. Therefore, instead of aligning 
the entire sequence globally, it is only suitable to iden- 
tify similarities locally. Thus, the suspected consensus 
regions have to be located and preprocessed ahead of 
alignment. 

Another approach for identifying the protein family's 
function by similar local subsequences [14] is called 
motif finding, which builds motifs into combinatorial 
models and probabilistic models. The combinatorial 
model identifies commonly repeated sequence patterns 
exhaustively [15-17]. Work reported in Pevzner et al. 
[18] and Mandoiu et al. [19] created cliques where ver- 
tices are sequence patterns, edges connect similar 
sequence patterns, and complete graphs represent the 
best consensus patterns. However, these combinatorial 
methods are computationally intensive [20,21] and pro- 
duce too many possible candidates. The probabilistic 
model commonly uses the position weight matrix 
(PWM), which estimates an amino acid at each position 
while assuming that each position is independent 
[22,23]. An alternative random sequence synthesis takes 
further frame-shifted position into consideration by 
optimally aligning amino acids to create a probabilistic 
sequence [24,25]. Other probabilistic methods make use 
of the Markov model, where the current state depends 
on a specified set of the past states. One such example 
is the popular pFam database [26], which builds a profile 
Hidden Markov Model (HMM) from the multiple 
sequence alignment of a protein family for classifying 
proteins and predicting their functionality. In general, 
the probabilistic models compress the data into prob- 
ability distributions and express amino acid associations 
as an ordered set of random variables. 

To overcome these limitations, we approached the 
problem from a data mining perspective where we first 
considered the occurrences and strength of the sequence 
patterns. We began by identifying a set of statistically 



Input Sequences 

CSMCHGAQS TDQS Q 
CVACHCAKD TKGC G 
CASCHVGGI TKTNP 
CAGCHDA TQRS VG P 
CAACHS ID TKWG P 
AVASQIMPLGNITQ 
AGVTDAMPPANLS H 
SGVSHAMPPPNAIS 



Output Confirmation 



> CAGCH 



AVA SQZMPLGN 



AGX 



CSMCHi IAQSTDQSQ 
CVACH 7AKDTKGCG 
CASCHVGGI TKTNP 
ATQRSVGP 
IDTKWGP 
TQ 



lAMPPANl SH 



sg\ shamppp::, is 



Figure 1 An intuitive example from the cytochrome c protein showing parts of the protein sequence that represent the binding sites 
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strong sequence patterns and developed an Aligned Pat- 
tern (AP) Synthesis Process to align and cluster similar 
patterns into a reduced set of Aligned Pattern Clusters 
(APCs) for representing the similar sequence patterns 
that might be associated with binding segments. These 
APCs capture both the statistically significant sequence 
association of amino acids as well as their conservations 
on each of the aligned columns. More precisely, our 
APC Process aligns and groups similar sequence pat- 
terns with variations to form a cluster of Aligned Pat- 
terns called APCs. We then examined whether or not 
the APCs correspond to the binding segment and bind- 
ing residues that reflect the protein's functionality. This 
paper is an expansion of Lee et. al [27] with an expan- 
sion upon reduction and ranking of the results. The 
three ranking presented are coverage, quality, and stan- 
dard residual. 

When our APC Process was applied to the cytochrome 
c and ubiquitin protein families, we discovered a reduced 
set of APCs solutions, which corresponds to the func- 
tional binding segments and binding residues of both 
families. Our APC Process obtained a set of solutions 
smaller when compared to the combinatorial methods, 
rendering a more compact yet knowledge-rich represen- 
tation in the form of the APC than the probabilistic 
method. Having a smaller set of richer representation is 
crucial in identifying the drug targets for drug discovery. 

Methodology 

This section introduces the mathematical notations and 
definitions required to describe the APC Process and 
the APC as well as the dual composition of amino acids 
in the original input data space and in the compact pat- 
tern space, both of which are used for calculating mea- 
sures for revealing pattern characteristics. Our Aligned 
Pattern Clustering Process, as illustrated by the text 
example, undergoes two steps (Figure 2): (1) the Pattern 
Discovery Step (PD Step), and (2) the Aligned Pattern 
Clustering Step (APC Step). The PD Step discovers the 
most significant and non-redundant amino acid associa- 
tions as sequence patterns amongst the family of 
sequences. Next, the APC Step groups and aligns these 
discovered patterns into APCs, even though the occur- 
rences of the pattern start at different positions in their 
input sequences, thus consensus regions do not need to 
be constrained nor specified. A glossary of terms and 
mathematical notation is available as Additional file 1 to 
complement the definitions in the Methodology section. 

Figure 2: A text example using the English alphabet 
illustrates the problem of sequence patterns with varia- 
tions. It was created to demonstrate each step of the 
process succinctly. This text example will be repeated 
throughout the paper. The overall APC Process contains 



two steps: the PD Step, and the APC Step. The final 
result is a list of APCsordered by their ranking. 

The input sequences 

To begin, the input sequence is built from the alphabet 
2 contains a set of characters [o\, <J 2 . . . , <T|s|-i, 
As an example, the English alphabet contains 26 charac- 
ters, {'a', 'b', 'y', V} = £, mathematically, <7i ='a', 
<7 2 - b'> • • ., <7 25 ='y', <7 26 ='z', and |S| = 26. 

A single sequence Let sk be a sequence indexed by k 
composed of consecutive elements taken from the 

alphabet £. s k = s\s\ . . . s^^s^, where each s] e E 

and s k is of length \s k \. For example, aaaaaaaaaaaa- 
HELLOaaaaaaaaaaaa is a sequence of length 29. This 
sequence can be represented by si, where \s 1 \ = 29, and 
the character at position 13 is s\ 3 = H. 

A set of sequences Let S = {s*|fe = i,...,|S|} = {s 1 ^ 2 , ...s 181 " 1 ^ 1 ) 
be the set of sequences that represents the set of the input 
sequences, also called the data space, where |§| is the total 
number of input sequences, and each sequence having the 
length of Is 1 !, |s 2 |, . . . , |s |S|_1 |, |s |S| | respectively. Let each 
sequence, say sequence k, be s k = s\ . . . s 1 * . . . sK., where 

sk e S is the elements at position /' of sequence k. Together 
the data space is the set of sequences is 



2 _ 2 2 2 2 
S - SjS 2 S 3 . . . S| S 2|, 



■■ s ...si 



Vr 



JS| _ JSMSI |S| JS| 
i — ^2 ^3 • . • j| 



|j|S||/ 



(1) 

(2) 
(3) 
(4) 
(5) 
(6) 



The pattern discovery step 

The PD Step is a previously developed pattern discovery 
and pruning algorithm [28] that obtains a condennse list 
of significant patterns from the family of protein 
sequences. 

The pattern In this paper, we consider a pattern as a 
statistically significant and non-redundant pattern as 
defined in Wong et. al [28]. 

Definition 1 A pattern p l = ■ • ■ ^j+lp'l-i i5 a 
short sequence over Z where \p l \ is its order (or length). 



Lee and Wong Proteome Science 2013, 11(Suppl 1):S8 
http://www.proteomesci.eom/content/1 1/S1/S8 



Page 4 of 14 




<Ov Step 1 
5 ) PATTERN 
sj±/ DISCOVERY 




Step 2 

PATTERN 
CLUSTERING 



J 



C Iterative 
Clustering 



Multiple Sequences 



Step 3 

VERIFICAITON & 
INTERPRETATION 



Product of Process 



1 . 


I Ml 


P 


L G 


N I T 


2. 


M P 


L 


G N 




5. 


A M 


P 


P A 


N V 



Sequence Patterns 



I M P L G N I T 
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A M P P - - - - 
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Aligned Pattern Cluster 



_i — G RCSMCHA' 
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_] — * * C i M C M A I 




pFam Alignment + 3D Structure 



Figure 2 A text example using the English alphabet illustrates the problem of sequence patterns with variations It was created to 
demonstrate each step of the process succinctly. This text example will be repeated throughout the paper. The overall APC Process contains 
two steps: the PD Step, and the APC Step. The final result is a list of APCsordered by their ranking. 



The sequence association is statistically significant and 
non-redundant in the sense that it is deltaclosed (i.e. it 
is not covered by a statistically significant super-pattern) 
and non-induced, (i.e. its statistical significance is not 
induced by its statistically strong sub-patterns ). A pat- 
tern p' is discovered by passing four statistical conditions 
defined in Wong et. al [28]. 

An UNALIGNED PATTERN p l is discovered by passes 
four statistical conditions defined in Wong et. al [28]. 

An occurrence of the pattern p' is expressed as 

occtf) = ji such that f = 44 +1 • • • s' ]j+m _ x , 

Where i is the index of the sequence in which that 
pattern occurs, and /, is the starting index in that 
sequence si where the pattern begins. 



s 1 s 1 



S 1 - s 1 - 

■ Vip'i-iVip'i 



2 2 

• 5 j2+ipi-i s )2+ipi 



Is 1 1 



. . 5 



|s 2 l 



(7) 



(8) 



JS| _ JS| 



JS| |S| 

iisi+n'isi+2 • ■ 



jsi \s\ 
jisi+ip'i-1 iisi+lp'l 



..pi 



The text example (Table 1) displays two patterns corre- 
sponding to our definition. The dataset contains two 
functional patterns, HELLO and MELLOW, are English 
words embedded in ten input sequences § = {s 1 , . . . ,s 10 }. 
The letters outside the patterns are stochastically gener- 
ated from the 26 characters in the English alphabet that 
are identically and independently distributed. 

Data induced by the unaligned pattern Let D(p'), 
be each of the unique occurrences of the pattern, p' , 
found in the input sequence. We call D(p') the data 
induced by p l or the induced data of p l . We will return 

Table 1 Example of patterns p 1 =hello and p 2 =mellow. 
§ The Input Sequences 



(10) 



bdxe j rtewkwkHELLOkcmsts j avtpi 

nf ixtHELLOuzdovcaaxnkj f j cvwk 

dimtndvkjmkHELLObkcmsts j 

tzhgarzof dHELLOpwkxmc 

ty j x j qnyHELLOwmopemlqf gptnwnq 

kntywtoaxMELLOWbtiasycma 

j ilxchitivMELLOWriiiweyf zgvuyaa 

hml zvMELLOWorgf eb 

xhml zvqgcanyMELLOWgbf j 

vqgcanyf f cMELLOWvcnsn j valbdvr 
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to the concept on the data induced by pattern when we 
use it to compute the measures for aligned columns 
within the context of APC. 

The aligned pattern clustering step 

For the APC Step, we developed an algorithm that gath- 
ers a set of similar patterns of different lengths obtained 
from the PD Step while aligning them into patterns of 
the same length by inserting gaps and wildcards. Con- 
strained by the statistical sequence association, the cor- 
responding elementNames in this cluster of patterns are 
lined up into columns, thus reflecting their conservation 
and variation [27]. 

In this paper, the APC Step is a single-linkage hier- 
archical clustering algorithm that takes an input of a list 
of patterns and synthesizes, or more precisely aligns and 
groups, the patterns into one or more APC(s) based on 
a similarity measure between APCs. Using the text 
example, Figure 3 illustrates one iteration of the hier- 
archical clustering algorithm. More precisely, it shows 
the last step of the iterative merge between APC C l and 
APC C 2 , thereby creating the new APC C 3 . 

Figure 3: In one iterative step of hierarchical cluster- 
ing, an existing APC, Ci with m = 3 and n = 6, is 
merged with another APC, C 2 with m = 3 and n = 5, to 
result in the new APC, C 3 , which is extended to m = 6 
and n = 6. 

Definition 2 A set of apc c = \d\i = i |C|) = [c 1 , c 2 c |C| -\ c |C| ) 

An APC, C 1 , is a set of similar horizontal sequence pat- 
terns that have been optimally grouped and vertically 
aligned into a set of patterns P f = {p 1 ,p 2 , . . ,p m \ repre- 
sented by C 1 , which is expressed as 



( Ci C2 . . . c„ ) . 



(14) 



C l = ALIGN(¥>), 



(s\ s\ 

2 2 
5-i S-, 



. . s 



s 1 \ 

2 



P 2 



\P m J 



(12) 



(13) 



C 1 



M 



C 1 -- 



D 1 } 



I M 



W\ 
W I 



■C 3 



Figure 3 In one iterative step of hierarchical clustering, an 
existing APC, C, with m = 3 and n = 6, is merged with another 
APC, C 2 with m = 3 and n = 5, to result in the new APC, C 3 , 
which is extended to m = 6 and n = 6 (a) Binding Segments (b) 
Binding Residues 



where sj e £ U {— } U {*} is an amino acid in the pat- 
tern, p', in an aligned column j. Each patterns of C l is 
aligned into length \C l \ = n, and there is a set of 
|P ! | = m patterns (rows) in C l . 

In the text example in Figure 3, Ci with m = 3 and 
n = 6, is merged with another APC, C 2 with m = 3 and 
n = 5, to result in the new APC, C3, which is extended to 
m = 6 and n = 6. 

Definition 3 An Aligned Pattern, which will simply 
be referred to as a pattern from this point forward, is a 
sequence of order-preserving amino acids maximizing 
the similarity of the patterns against a set of pattern 
from an APC, p' of size |P ! | = m with gaps, wildcards, 

and mismatches. Let p' e p' be s\s' 2 . . . s'^, where 

sj e E U {— } U {*} is an amino acid in the pattern pi 

and in the aligned column index Cj. 

Definition 4 An aligned column cj in C 1 represents the 
f column of amino acids that have been aligned from the 
set of patterns contained in the current APC, C l = (c 1 c 2 ... 
c„). A conserved ALIGNED COLUMNz's conserved to only 
one type of amino acid such that Cj = [<7 ... o ... o] where 
ae E. 

For the text example, the APC Step creates an APC con- 
taining six patterns with six aligned columns (Table 2). 
The APC is obtained from the alignment of a set of similar 
patterns, where each row is a pattern from p' and each 
column is an aligned column of amino acids. Here, the 
pattern for the third row is p = HELLO and the aligned 
column for the first position is C\ = [BM HBBH] T 

Data induced by apc Let D(C') be data induced by 
the APC C 1 , which is the subset of segments from the 
input sequences, or the data subspace containing all 
the pattern from the APC, Cl, where its corresponding 
V' = {p 1 ,p 2 ,...p m } T . We call D(C') the data induced 
by C l or the induced data of C l . Then D(C ; ) is then 
the union of the segements from the input sequences 
induced by all the patterns contained in C 1 , 
D(C') =B(p 1 )UO(p 2 )U---UO(p m ) = y O^) 

VpieP' 

Table 2 Example of an APC for the text example. 



p\Cj 

/pl\ 



\P 6 / 



(cl c2 c3 c4 c5 c6)i X 6 

(B E L L O w\ 
LOW 
L O * 
L 
K 



6x1 



M E 

H E 

B A 

B A 

H A 



L 
L 
L 
L 
L 
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S * 
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Measuring and ranking results 
The three measures of APCs 

In order to rank the set of constructed APCs, C , three 
measures are computed for each APC, C . The three 
measures are Coverage, APC Quality, and Standard 
Residual. 

Coverage The coverage of an APC accounts for the 
total input sequences that are covered by the APC, C l , 
over the entire set of input sequences. Note that this is 
also counting the number of occurrences in the induced 
dataspace O(C') • 

APC Quality The APC Quality, Q, is the average col- 
umn entropy subtracted from one, where entropy is 
computed from the set of Aligned Patterns, p' e c' • 
The APC Quality measures the stability or reliability of 
a APC, whereas the entropy measures the randomness 
or the degree of variation within an APC. The value of 
Q approaches one while the resulting APC is more 
stable. The value of Q approaches zero while the result- 
ing APC is more random. Q is expressed as: 



Q= 1 



EH(cj) 



(15) 



where cj is the aligned column in the resulting APC. 
H ( c j) = " E Pr ( c i = a ) ^gPr{ Cj = (16) 



V<recj 



Pr( Cj = a) 



(17) 



m 



where a e S U {-} U {*} is the amino acid jj of p t at Cp 
and the probability Pr{cj = <7) is computed from count- 
ing the subset of patterns in P'. 

Standard Residual The Standard Residual measures 
the statistical significance of the APC by comparing the 
actual number of occurrences, o, of all the patterns 
included in the APC, against the expected number of 
occurrences, e, which is computed from the default ran- 
dom model of APC. It is written as 



StandardResidual 



(18) 



where o is the actual number of occurrences of the 
pattern in P z counted from the input data, D(C') and e 
is the expected number of occurrences computed from 
the default random model of APC, C, by assuming that 
each of the aligned columns c,- are independent and 
identically distributed (i.i.d.) shown below: 



E[C], 



(19) 



= N{Pr{C)), 

= N(Pr{c 1 )Pr(c 2 )...Pr{c„)), 



N 



Y\Pr{cj) 



(20) 
(21) 

(22) 



\i =1 



where N is the length of the input sequence and each of 
the aligned columns c y e C is i.i.d. To compute the default 
probability of the aligned columns, Pr(cj), sum the prob- 
ability of all the possible amino acids in the one single 

aligned columns. —t— •, where Pr(cj = ct/j) = for 

each Ok is i.i.d. Returning to the text example with 6 pat- 
terns and 6 aligned columns and the English alphabet, 

Pr (ci) = Pr (d =B)+Pr (d =M)+Pr (ci = H) = ^. 
Therefore, the final expectation is 



■N 



n e ^ 



CTfe 



(23) 



The redundancy measure of the aligned columns 

The Redundancy Measure [29] indicates the specificity 
or stability of the amino acids in an aligned column 
based on the frequency of the occurrences of the amino 
acids taken from that aligned column of its D(C') • The 
Redundancy Measure Rl{cj ) for the aligned column cj is 



Rl( Cj ) = l- H(c,). 



(24) 



where H{cj) with Pr(cj = a) being computed from count- 
ing of a in the aligned column, cj, of the entire input 
sequences, D(C') • Hence, a conserved aligned column has 
Rl(cj) = 1 since minimum entropy value oiH{cj) = 0. Simi- 
larly a variable aligned column has Rl(cj) = 0 the maxi- 
mum entropy value of H(cj) = 1. If the amino acid 
occurrences in D(C') are equiprobable. 

Note that the entropy of the Redundancy Measure is 
computed from the entropy of the induced data, D(C') , 
whereas the entropy of the APC Quality uses the amino 
acids from the patterns in p' . This is because the quality 
of the APC measures how much variation or stability is 
in the patterns, whereas the redundancy of the aligned 
column measures how much the redundancy or consis- 
tency is in the induced data. 

Results and discussions 

We applied our APC Process on the cytochrome c and 
the ubiquitin protein families in order to examine how 
the resulting APCs relate to the binding sites, which are 
the biologically significant regions of the protein. There 
are three aspects we would like to explore: the reduction 
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of the set of candidate solutions from the discovered 
patterns to APCs obtained; how each pattern in the 
APC surrounding the binding site represents a binding 
segment in a single strand of protein; and how binding 
residues correlate to their aligned column. Finally, we 
display our results underneath the pFam multiple 
sequence alignment to compare the differences in the 
representations. In the comparison, we demonstrate the 
overall hierarchical clustering performance of our APC 
Process as well as the quality of the resulting APCs. 

Cytochrome C results 

First, we demonstrated that by grouping similar patterns 
together, the APC reduces the number of candidate 
solutions to be examined without losing information. 
Next, we showed that in the binding APCs, each pattern 
represents a binding segment in the protein sequence 
and each of the two binding sites is represented by a 
specific aligned column. The 317 sequences from the 
cytochrome c protein family were obtained on Septem- 
ber 17th, 2012 from Uniprot by searching the following 
terms: cytochrome c; AND reviewed:yes; AND namex*; 
AND mnemonicx*; AND (name:cytochrome AND 
namex); NOT name:type; NOT name:VPR; NOT name: 
biogenesis; NOT name:*ase; NOT (namexytochrome 
AND name:b*); NOT like; NOT proba*; AND fragment: 
no; AND active:yes. These selected parameters should 
help to yield a reasonable number of input sequences 
for the APC Process. From these 317 input sequences, 
the PD Step was executed with the minimal order of 5, 
the minimum occurrence of 20, and the delta of 0.9. 
The PD Step discovered 154 patterns from the cyto- 
chrome c protein family, where 28 patterns, or 18.18% 
of the total patterns, contain the proximal binding site, 
Hisl8, and 23 patterns, or 14.94% of the total patterns, 
contain the distal binding site, Met62, resulting in a 
combined total of 33.12% of the discovered patterns that 
contains one of the two binding sites. Therefore, the set 
of patterns redundantly covers the two binding sites. 
This observation indicates that each individual pattern 
alone covers only a small fraction of the input sequences 
in the data space; therefore, a single pattern by itself 
cannot fully represent the rich variations of all the input 
sequences within the entire protein family. Hence, the 
APC, which contains a set of similar patterns that has 
been grouped and aligned to allow variations, provides a 
reduced and much richer representation of the binding 
segments and binding residues. 

In the APC Step, we showed that our APC Process 
reduced the number of candidate solutions without losing 
any information and richly captured the binding sites in the 
compact APCs where the binding segments are the pat- 
terns therein and the binding sites are the conserved 
aligned columns. We ensure that all the patterns discovered 



are strongly statistically signifxant by starting with a tighter 
configuration to ensure the quality of the result. From this 
list of 154 statistically significant and non-redundant 
patterns obtained from the previous PD Step, the APC 
Step was executed with the following settings: the Merge 
Algorithm as Global Alignment, the SIMILARITYScore as 
Hamming Distance, the TERMINATIONCo«<iifioM Score 
less than 0.8, the heuristics column distribution score 
greater than 0.8 and the minimum of three overlapping 
column matches. 

We found the following two results (Table 4): five 
APCs (13.89% of the total number of APCs) discovered 
contains the proximal binding site, Hisl8; five APCs 
(13.89% of the total number of APCs) contains the distal 
binding site, Met62; and 27.78% of the combined total. 
This observation indicates that, while retaining the full 
information, the 154 patterns were reduced to 36 APCs, 
a total reduction of 76.62% for documentation and 
visualization. 

It can be seen in Table 5, the top four resulting APCs 
correspond to the proximal and distal binding segments 
of the cytochrome c protein family. More specifically, 26 
proximal patterns were reduced to the two top APCs (a 
92.31% reduction) and 16 distal patterns are reduced to 
the two top APCs (a 87.50% reduction), for a combined 
reduction of 88.10% for these top four APCs. 

Cytochrome C discussion 

Biologically, the two binding residues in the cytochrome 
c protein that bind the heme ligand are the proximal 
binding residue that binds the heme ligand from the 
proximal side of the protein [4,30]; and the distal bind- 
ing residue that binds the heme from the other side of 
the protein [31]. The proximal and distal binding sites 
are located in the protein and bind the heme ligand 
from above and below the horizontal plane, respectively. 
Specifically, one particular amino acid from each of the 
two protein segments binds the iron molecule located in 
the centre of the heme: the "H" (Histidine) residue at 
position 18, which is the proximal side of the protein 
sequence and the "M" (Methionine) residue at position 
62, which is the distal side of the protein sequence. 
These two binding residues, Hisl8 and Met62, are also 
confirmed by the three-dimensional structure, PDBID 
1F1F, of the cytochrome c protein (Figure 4). Our 
results showed that the set of APCs discovered by our 
APC Step that contained the protein binding sites - the 
main biological function of the protein. In fact, the four 
top resulting APCs precisely correspond to these crucial 
binding segments that contain conserved aligned col- 
umns corresponding to the binding residues. 

Figure 4: One three-dimensional structure from the 
cytochrome c protein family, PDB ID 1F1F, is displayed. 
The top-two statistically significant APCs from the 
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Table 3 The 36 APCs of the Cytochrome C Family Ranked by Standard Residual (where m = the number of patterns in 



the APC, and n = length of the APQ). 





APC (as regular expressions) 


m 


n 


Quality 


Coverage 


Standard Residual 


Binding Site 


1 


WGEDTLMEYLENPKKYIPGTKMIFAGIKKK 


8 


30 


0.57 


81 


5.92E+16 


Met62 


2 


MGDVEKGKKIFVQ[KR]CAQCHTVEKGGKHKTGPNL 


19 


33 


0.43 


119 


5.04E+16 


His1 8 


3 


QCHTVEKGGKHiaGPNLHGLFGRKTGQA 


7 


28 


0.41 


46 


8.32E+14 


Hisl 8 


4 


TLYDYLLNPKKYIPGTKM[VA]FPGLKKPQ 


8 


27 


0.44 


1 16 


1.91E+14 


Met62 


5 


GAGHK[QVT]GPNL[NH]GLFGRQSGTT 


13 


21 


0.4 


125 


3.53E+10 




6 


GFSYTDANKNKGITWGE 


8 


17 


0.41 


66 


6.33E+08 




7 


GEKIFiaKCAQCHTV 


3 


15 


0.57 


24 


6.45 E+07 


His18 


8 


MGDVEKGKKIFVQKC 


7 


15 


0.4 


53 


5.04E+07 




9 


GPNLHGLFGRI<TGQA 


-1 


15 


0.43 


46 


4.37E+07 




10 


ERADLIAYLK[KE]ATNE 


9 


15 


0.4 


91 


3.53E+07 




11 


HGLFGRKFGQAPGF 


9 


14 


0.46 


70 


2.10E+07 




12 


IPGTKMAFGGLKK 


4 


13 


0.42 


136 


9.06E+06 


Met62 


13 


AANKNKGITWGE 


4 


12 


0.5 


54 


1.60E+06 




14 


LHGLFGR[QK]SGTF 


6 


12 


0.42 


88 


1 .07E+06 




15 


AGYSYSAANKN 


5 


11 


0.43 


30 


1 .40E+05 




16 


TLYDYLLNP 


2 


9 


0.56 


29 


2.69E+04 




17 


GQAPGFSY 


2 


8 


0.5 


27 


5.57E+03 




18 


TKMVFAG 


2 


7 


0.57 


52 


3.38E+03 


Met62 


19 


GGKHI^G 


2 


7 


0.43 


64 


2.94E+03 




20 


EKGKKIF 


2 


7 


0.43 


62 


2.85 E+03 




21 


FAGLKKP 


3 


7 


0.48 


57 


2.62E+03 




22 


WGGGKIY 


2 


7 


0.71 


27 


2.48E+03 




23 


FAGIKKK 


2 


7 


0.43 


51 


2.34E+03 




24 


YLKKAT 


1 


6 


1 


29 


1.19E+03 




25 


WGEDTL 


1 


6 


1 


25 


1 .02E+03 




26 


NCAACH 


2 


6 


0.83 


30 


8.68E+02 


Hisl 8 


27 


KGAGHK 


2 


6 


0.83 


26 


7.52E+02 




28 


KGITW 


1 


5 


1 


49 


4.46E+02 




29 


GFSYT 


1 


5 


1 


42 


3.83E+02 




30 


FVQKC 


1 


5 


1 


39 


3.55E+02 




31 


DANKN 


1 


5 


1 


34 


3.10E+02 




32 


GYSYT 


1 


5 


1 


28 


2.55E+02 




33 


AMPAF 


1 


5 


1 


24 


2.19E+02 


Met62 


34 


CHAGG 


1 


5 


1 


22 


2.00E+02 


Hisl 8 


35 


FKTRC 


1 


5 


1 


20 


1 .82E+02 




36 


LFEYL 


1 


5 


1 


20 


1 .82E+02 





cytochrome c protein are the proximal binding segment 
(in pink) and the distal binding segment (in blue) that 
bind the heme from above and below the horizontal 
plane, respectively. More specifically, one specific amino 
acid from each of the two segments binds the iron mole- 
cule from the centre of the heme: the "H" (Histidine) 

Table 4 Comparing the Number of APCs and Patternss. 

Patterns % APCs % % 

Count overall Count overall Reduction 

His18 28 18.18% 5 13.89% 82.14% 

Met62 23 14.94% 5 13.89% 78.26% 



residue at position 18 of the proximal segment and the 
"M" (Methionine) residue at position 62 of the distal 
segment. 

The ten APCs that correlate to the two binding sites 
were first clustered based on their horizontal patterns in 

Table 5 Comparing the Top Four APCs and their 
Patterns. 

Pattern Count APCs Count %Reduction 

His18 26 2 92.31% 

Met62 16 2 87.50% 



Total 154 33.12% 36 27.78% 76.62% Total 42 4 88.10% 
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Figure 4 One three-dimensional structure from the cytochrome c protein family, PDB ID 1F1F, is displayed The top-two statistically 
significant APCs from the cytochrome c protein are the proximal binding segment (in pink) and the distal binding segment (in blue) that bind 
the heme from above and below the horizontal plane, respectively. More specifically, one specific amino acid from each of the two segments 
binds the iron molecule from the centre of the heme: the "H" (Histidine) residue at position 18 of the proximal segment and the "M" 
(Methionine) residue at position 62 of the distal segment. 



their rows and are then aligned into their aligned col- 
umns that reveal their vertical stability. Firstly, each 
APC contains a set of conserved patterns that are simi- 
lar to one another. Although these patterns suggest 
their horizontal significance in the protein family, indivi- 
dually they do not identify the significance of the amino 
acid's conservation and variation. Thus, the aligned col- 
umns is important for identifying the stability of the 
binding residue. Secondly, the aligned columns of each 
binding APC show the conservation of aligned columns 
in the cluster, which otherwise is not easily seen in the 
individual non-variable patterns. For example, consider 
the top two APCs that correspond to each of the proxi- 
mal (Table 7) and distal binding segments (Table 6). In 
the Tables, the columns in bold are the conserved 
aligned columns with Rl = 1.0, where Rl reflects the 
specificity of the residue of the site in the APC. The 
aligned columns corresponding to the binding sites of 
the APCs have an Rl value of 1.0, that is, the amino 
acid for that aligned column is conserved in the data 
space. To give a precise example, consider the proximal 
APC that is ranked second. This APC has three 



Table 6 The Distal APC of the Cytochrome C Family. 


patterns 


Count 


Score 


WGEDTLMEYLENPKKYI PGTKMI F* * * * * * 


22 


1 


94E+03 


* * *DTLMEYLENPKKYI PGTKM* ******* 


26 


1 


30E+03 


******* EYLENPKKYI PGTKMI FAGIKK* 


35 


2 


54E+02 


****TLMEYLENPKKYI PGTKMI FAGIKKK 


29 


7 


34E+02 


* * * *TLMEYLENPKKYI PGTKMI FAG* * * * 


34 


4 


81E+01 


******* * YLENPKKYI PGTKM* ******* 


81 


6 


51E+02 


*******EYLENPKKYIPGTKMIFAG**** 


42 


5 


44E+01 


******* EYLENPKKYI PGTKM* ******* 


65 


2 


88E+01 



conserved aligned columns with Rl value of 1.0: Glnl6, 
Cysl7, and Hisl8. The Hisl8 conserved aligned column 
is the proximal binding residue, and the Cysl7 binds an 
adjacent corner on the heme ligand. Similarly, the con- 
served aligned column representing Met62 in the distal 
APCacts as the distal binding residue. The other con- 
served aligned columns can be used to identify other 
important functions in the protein. 

By matching the individual APCs up to the indepen- 
dent HMM alignment of pFam (Figure 5), we confirmed 
the validity of our set of 36 APCs. In addition, our 



Table 7 The Proximal APC of the Cytochrome C Family. 


patterns 


Count 


Position 


***** *GKKI FVQKCAQCHTV* ******** 


23 


6 


27E+04 


* * * *EKGKKI FVQKCAQCHT* ********* 


23 


1 


32E+04 


MGDVEKGKKI FVQKCAQCHTVEKGGKHKTG 


20 


7 


50E+07 


****** GKKI FVQKCAQCHTVEKGGKHKTG 


20 


1 


16E+06 


************* KCAQCH *********** 


57 


1 


59E+01 


************** CAQCH *********** 


89 


2 


58E+03 


*********** * *RCAQCHT* ********* 


21 


1 


38E+01 


************** CAQCHT* ********* 


76 


3 


01E+01 


********** FVQKCAQCHTVE ******** 


27 


5 


88E+02 


*********** *QKCAQCHT* ********* 


32 


6 


38E+01 


*********** *QKCAQCHTVEKGGKHKTG 


23 


6 


33E+04 


*********** * *KCAQCHTVEKG* ***** 


30 


4 


91E+01 


************* KCAQCHTV* ******** 


51 


1 


73E+01 


************** CAQCHTV* ******** 


65 


3 


10E+01 


************ * *CAQCHTVEK* ****** 


34 


1 


30E+01 


************** CAQCHTVE ******** 


49 


2 


41E+01 


**************** QCHTV* ******** 


95 


2 


33E+03 


************** * *QCHTVEKGG* * * * * 


45 


1 


75E+01 


**************** QCHTVE ******** 


77 


3 


15E+01 
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Figure 5 Ten resulting APCs representing the proximal and distal binding segments of the cytochrome c are compared to the HMM 
logo from pFam. In the largest APC, Cys1 7 is identified as one of the conserved aligned columns, where His18 binds to the heme iron. In the 
second largest APC, Met62 is identified as one of the conserved aligned column of the distal binding segment, where Met62 binds the heme iron. 



proximal APC for cytochrome c is consistent with the 
proximal binding motif: [C]-x(2)-[CH], from PROSITE 
(PDOC00169) [32,33] and a strong emission probability 
in pFam (PF00034) [26,34]. Moreover, our method 
strongly identified the distal binding in our APCs where 
PROSITE does not annotate the binding site and pFam 
identifies only as a weak emission probability. 

In conclusion, the APC can represent protein functions 
such as the binding segments and binding residues and 
presents a reduced set of candidate solutions and specifies 
their location in the protein family. In cytochrome c, the 
prevention of binding can block cancer progression, which 
is an important drug discovery for cancer treatment. 

Ubiquitin results 

To further study the APC Step, we closely examined the 
iterative steps and its resulting APCs using the ubiquitin 
protein family. The 70 sequences from the ubiquitin 
protein family used in our experiment were obtained on 
August 9th, 2012 from Uniprot by searching the follow- 
ing terms: name:ubiquitin; NOT namerase; NOT name: 
like; NOT name:ribosomal; NOT name:modifier; NOT 
name:factor; NOT name:protein; NOT namexonjugat- 
ing; NOT name:activating; NOT name:enzyme; AND 
reviewed:yes; AND mnemonicUB*. 

Figure 5: Ten resulting APCs representing the proximal 
and distal binding segments of the cytochrome c are 
compared to the HMM logo from pFam. In the largest 
APC, Cysl7 is identified as one of the conserved aligned 
columns, where His 18 binds to the heme iron. In the sec- 
ond largest APC, Met62 is identified as one of the con- 
served aligned column of the distal binding segment, 
where Met62 binds the heme iron. 

Figure 6: The seven Lys binding residues of the ubi- 
quitin protein family are highlighted in the APCs: Lys6, 



Lysll, Lys27, Lys29, Lys33, Lys48, and Lys63. Six of the 
seven binding sites are discovered, all except Lys29, are 
conserved aligned column with Rl = 1.0. 

These adopted parameters help yield a reasonable 
number of input sequences for our study. From these 70 
input sequences, the PD Step was executed with the 
minimal order of 10, the minimum occurrence of 20, 
and the delta of 0.9 to yield a proper size of the results 
for the study. Table 8 shows the thirty discovered pat- 
terns, where all except five of the patterns contained the 
seven binding residues. Nevertheless, these patterns still 
corresponded to the conserved amino acids around the 
binding residues. Therefore, all the discovered patterns 
indicate important functionality in the ubiquitin protein 
family, such as the binding site or the areas next to the 
binding site. Once again, each pattern on its own occurs 
only a few times, and has only a low frequency count 
for representing the binding segments of this protein 
family. Since protein binding segments exhibit consider- 
able variability, APCs represent the protein family's 
functional binding sites more explicitly and effectively. 

From this list of 30 statistically significant patterns 
obtained from the previous PD Step, the APC Step was 
executed with the following settings: the MERGE Algo- 
rithm as Global Alignment, the SIMILARITY Score as 
Hamming Distance, the TERMINATION Condition Score 
less than 0.3, the heuristics column distribution score 
greater than 0.3 and the minimum of three overlapping 
column matches. We demonstrated the efficacy of our 
APC Process by showing the reduced set of 9 APCs and 
their binding sites (Table 9). 

Ubiquitin discussion 

The ubiquitin protein contains seven lysine residues, 
Lys6, Lysll, Lys27, Lys29, Lys33, Lys48, and Lys63, that 
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Figure 6 The seven Lys binding residues of the ubiquitin protein family are highlighted in the APCs: Lys6, Lysll, Lys27, Lys29, Lys33, 
Lys48, and Lys63. Six of the seven binding sites are discovered, all except Lys29, are conserved aligned column with R1 = 1.0. 



Table 8 Statistically Ranked Patterns Discovered from the Sequences of the ubiquitin Family. 



Ranking 



Pattern 



Frequency 



Score 



Binding Residue 



1 



5 

6 

7 

8 

9 

10 

11 

12 

13 



MQIFVKTLTGKTITLEVEPSDTIENVKAK1 

QDKEGIPPDQQRLIFAGKQLEDGRTLSDYN 

IQKESTLHLVLRLRGG 

MQIFVKTLTGKTITLEVESSDTIDNVKAKI 

QDKEGIPPDQQRLIFAGKQLEDGRTLADYN 

IQKESTLHLVLRLRGG 

SDTIENVKAKIQDKEGIPPDQQRLIFAGKQ 

LEDGRTLSDYNIQKESTLHLVLRLRGG 

SDTIDNVKAKIQDKEGIPPDQQRLIFAGKQ 

LEDGRTLADYNIQKESTLHLVLRLRGG 

MQIFVKTLTGKTITLEVESSDTIDNVKAKI 

QDKEGIPPDQQRLIFAGKQLEDGRTL 

IENVKAKIQDKEGIPPDQQRLIFAGKQL 

EDGRTLSDYNIQKESTLHLVLRLRGG 

VKTLTGKTITLEVESSDTIDNVKAKIQD 

KEGIPPDQQRLIFAGKQLEDGRTLAD 

TITLEVEPSDTIENVKAKIQDKEGIPPD 

QQRLIFAGKQLEDGRTLSDYNI 

KIQDKEGIPPDQQRLIFAGKQLEDGRTL 

SDYNIQKESTLHLVLRLRGG 

KEGIPPDQQRLIFAGKQLEDGRTLSDY 

NIQKESTLHLVLR 

IPPDQQRLIFAGKQLEDGRTLADYNIQ 
KESTLHLVLRLRGG 

NVKAKIQDKEGIPPDQQRLIFAGKQLE 
DGRTLSDYNI 

KIQDKEGIPPDQQRLIFAGKQLEDGRT 



21 

15 

21 
17 
17 
32 
17 
24 
39 
44 
20 
36 
44 



5.44E+44 

2.86E+44 

1.25E+33 
7.59E+32 
4.76E+31 
3.48E+31 
1.59E+30 
8.80E+28 
7.43E+27 
3.66E+23 
3.38E+23 
6.15E+21 
5.20E+18 



Lys6, Lysll, Lys27, 
Lys29, Lys33, Lys48, 
Lys63 

Lys6, Lysll, Lys27, 
Lys29, Lys33, Lys48, 
Lys63 

Lys27, Lys29, Lys33, 
Lys48, Lys63 
Lys27, Lys29, Lys33, 
Lys48, Lys63 
Lys6, Lysll, Lys27, 
Lys29, Lys33, Lys48 
Lys27, Lys29, Lys33, 
Lys48, Lys63 
Lys6, Lysll, Lys27, 
Lys29, Lys33, Lys48 
Lys27, Lys29, Lys33, 
Lys48 

Lys29, Lys33, Lys48, 
Lys63 

Lys33, Lys48, Lys63 

Lys48, Lys63 

Lys27, Lys29, Lys33, 
Lys48 

Lys29, Lys33, Lys48 
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Table 8 Statistically Ranked Patterns Discovered from the Sequences of the ubiquitin Family. (Continued) 





i cnvM 








1 A 
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I J 


K I b I 1 1 I bbvbbbU I IDINVKAKIUL* 


1 Q 

i y 


o.U I b+ I j 


1 \ icf, 1 ur1 1 I \/c3~? 

byso, bys i i , bysz/, 




}\tXJ 






I \ /fin I wr33 

byszy, bys33 


16 


MUlrvKI L I bK 1 1 1 bbvbrbU I IblWK. 


25 


1 .1 7E+1 5 


I wr /; | . , r i i I , 17 

byso, bys 1 I , bysz/ 


1 7 
I / 


IVIUIrVK I L I UK I I I bbvbbbU I lUNVK 


15 


Q A QflOQH_i_ 1 A 

o.4oUyob+ 1 4 
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1 Q 
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OZ 
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bysoj 


1 Q 


IVIL/lr vl\ I b I UK I II bbvb 


ou 


\ / doZd'DdZdd 


byso, bys i i 


20 


Kl L I uM 1 1 bbvbbbU 1 1 


26 


1 1 3 C ~7 1 D70/I 

i i 33/ i y/o4 


I \ ir£, I wr- 1 1 

byso, bys I I 


21 


bbvbbbU I IUNVK 


26 


"7"7c i a cn no 


Lys27 


22 


TITLEVEPS 


28 


28304.96142 




23 


KTLTGKT 


67 


3796.714675 


Lys6, Lys11 


24 


DGRTLAD 


23 


1 298.702247 




25 


STLHL 


69 


1102.599421 




26 


KTITL 


67 


315.8836468 


Lysl 1 


27 


IENVK 


38 


309.1891137 


Lys27 


28 


VEPSD 


28 


260.0761993 




29 


TLADY 


23 


191.1286116 




30 


IDNVK 


29 


180.0682775 


Lys27 



can be linked to another ubiquitin to form a poly- 
ubiquitin chain [35-37]. The seven binding residues are 
visualized in the three-dimensional structure of the ubi- 
quitin protein (Figure 7). Our resulting APCs 

Figure 7: The three-dimensional structure of the ubi- 
quitin protein, with PDB ID 1UBQ from the protein 
data bank, has seven binding residues: Lys6, Lysll, 
Lys27, Lys29, Lys33, Lys48, and Lys63. 

Correspond to six of the seven binding residues (Lys6, 
Lysll, Lys27, Lys33, Lys48, and Lys63). The remaining 
Lys33 is found in an APC with only one pattern and thus 



stands out as a significant functional group with a dis- 
tinct pattern discovered with high statistical significance 
in the PD Step. 

For ubiquitin, our APCs are pattern alignments that 
agree with the emission probabilities of the pFam profile 
HMM (Figure 6). All eight APCs discovered agreed with 
the pFam HMM emission probability. Surprisingly, our 
results differs from PROSITE's consensus motif 
(PDOC00271), which missed 172 ubiquitin proteins. In 
drug discovery, preventing the linking of ubiquitin to its 
binding proteins via its binding site inhibits cancer growth. 



Table 9 The 36 APCs of the ubiquitin Family Ranked by Standard Residual (where m = the number of patterns in the 
APC, and n = length of the APC)). 





APC (as regular expressions) 


m 


n 


Quality 


Coverage 


Standard Residual 


Binding Site 


1 


MQIFVKJLTGKJITLEVE[SP]S 


10 


76 


0.31 


61 


4.7E+39 


Lys6, Lysll, 




DTI[DE]NVKAKIQDKEGIPPDQ 












Lys27, Lys29, 




QRLIFAGKQLEDGRTL[SA]DYN 












Lys33, Lys48, 




IQKESTLHLVLRLRGG 












Lys63 


2 


NVKAKIQDKEGIPPDQQRLIFAG 


5 


52 


0.5 


67 


3.3E+29 


Lys27, Lys29, 




KQLEDGRTL[SA]DYNIQKESTL 












Lys33, Lys48, 




HLVLRLRGG 












Lys63 


3 


MQIFVKTLTGKTITLEVEP[SP] 


5 


27 


0.34 


67 


2.7E+14 


Lys6, Lysl 1, 




DTI[ED]NVK 












Lys27 


4 


DYNIQKESTLHLVLRLRGG 


1 


19 


1 


62 


2.2E+12 


Lys63 


5 


LEVE[SP]SDTIDNVK 


2 


13 


0.31 


54 


1 .OE+07 


Lys27 


6 


KTITL EVEPS 


2 


10 


0.4 


68 


4.0E+05 


Lysl 1 , Lys27 


7 


DGRTLADY 


2 


8 


0.5 


24 


1 4E+04 




8 


STLHL 


1 


5 


1 


69 


1 .7E+03 




9 


l[ED]NVK 


2 


5 


0.8 


67 


1 .2E+03 


Lys27 
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Figure 7 The three-dimensional structure of the ubiquitin 
protein, with PDB ID 1 UBQ from the protein data bank, has 
seven binding residues: Lys6, Lys11, Lys27, Lys29, Lys33, 

Lys48, and Lys63. 

. ■ 



Conclusion 

Our APC Process greatly reduces the number of APCs in 
comparison with other methods. This is due to the fact 
that the APC sstep starts with input patterns from the 
PD Step rather than the entire input search space. This 
drastically reduces the search space in a controlled man- 
ner. From the application aspect, using data from two 
Uniprot protein families (cytochrome c and ubiquitin), 
the majority of top-ranking APCs corresponded to their 
protein binding segments. The resulting cytochrome c 
binding APCs agree with the pFam emission probability. 
An APC represents a set of patterns as the horizontal 
rows and its aligned columns as the vertical columns, 
which can be further evaluated for amino acid conserva- 
tions. In fact, for cytochrome c, the proximal and distal 
binding residues correspond to conserved aligned col- 
umns with Rl of 1.0. In addition, the distal APC identi- 
fies one conserved aligned column with Rl of 1.0 as the 
binding residue, which is not identified in PROSITE or 
pFam. While the ubiquitin APCs agree with pFam emis- 
sion probability, six of the seven binding residues are suc- 
cessfully identified in the APC. 

In conclusion, APCs can be used to reveal functional 
domains across different protein families without relying 
on prior knowledge or clues about the consensus regions. 
Currently, we are using aligned column variations as 
amino acid characteristics to classify protein species and 
gene labels. We are also extending the algorithm to dis- 
cover interdependencies within APCs and long-distance 
associations among APCs. In more general cases of pro- 
tein analysis, the function and the nature of the protein 
function are not clear; thus, the capability that overcomes 
such difficulties marks the uniqueness and novelty of our 
APC Process. In the broader sense, this knowledge is 
essential for understanding the proteins involved in 



epigenetics for drug discovery [38]. The development of 
cancer generally increases with age, and with the ageing 
baby-boomer population it is crucial for drug companies 
to find cost-effective and time-saving techniques for drug 
discovery. 

Additional material 



Additional file 1: The glossary of terms and mathematical notations 
to complement the definition in the Methodology section of this 
paper 
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